首页> 外文OA文献 >An Automated Procedure to Identify Biomedical Articles that Contain Cancer-associated Gene Variants
【2h】

An Automated Procedure to Identify Biomedical Articles that Contain Cancer-associated Gene Variants

机译:自动化程序,以鉴定包含与癌症相关的基因变异的生物医学文章

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The proliferation of biomedical literature makes it increasingly difficult for researchers to find and manage relevant information. However, identifying research articles containing mutation data, a requisite first step in integrating large and complex mutation data sets, is currently tedious, time-consuming and imprecise. More effective mechanisms for identifying articles containing mutation information would be beneficial both for the curation of mutation databases and for individual researchers. We developed an automated method that uses information extraction, classifier, and relevance ranking techniques to determine the likelihood of MEDLINE abstracts containing information regarding genomic variation data suitable for inclusion in mutation databases. We targeted the CDKN2A (p16) gene and the procedure for document identification currently used by CDKN2A Database curators as a measure of feasibility. A set of abstracts was manually identified from a MEDLINE search as potentially containing specific CDKN2A mutation events. A subset of these abstracts was used as a training set for a maximum entropy classifier to identify text features distinguishing \u22relevant\u22 from \u22not relevant\u22 abstracts. Each document was represented as a set of indicative word, word pair, and entity tagger-derived genomic variation features. When applied to a test set of 200 candidate abstracts, the classifier predicted 88 articles as being relevant; of these, 29 of 32 manuscripts in which manual curation found CDKN2A sequence variants were positively predicted. Thus, the set of potentially useful articles that a manual curator would have to review was reduced by 56%, maintaining 91% recall (sensitivity) and more than doubling precision (positive predictive value). Subsequent expansion of the training set to 494 articles yielded similar precision and recall rates, and comparison of the original and expanded trials demonstrated that the average precision improved with the larger data set. Our results show that automated systems can effectively identify article subsets relevant to a given task and may prove to be powerful tools for the broader research community. This procedure can be readily adapted to any or all genes, organisms, or sets of documents.
机译:生物医学文献的激增使研究人员越来越难以找到和管理相关信息。但是,识别包含突变数据的研究文章是整合大型复杂突变数据集的必要第一步,目前繁琐,耗时且不精确。识别含有突变信息的文章的更有效机制对于突变数据库的管理和单个研究人员都是有益的。我们开发了一种自动化的方法,该方法使用信息提取,分类器和相关性排名技术来确定MEDLINE摘要的可能性,该摘要包含与适用于突变数据库的基因组变异数据有关的信息。我们针对CDKN2A(p16)基因和CDKN2A数据库策展人当前使用的文件识别程序作为可行性的衡量标准。从MEDLINE搜索中手动识别出一组摘要,其中可能包含特定的CDKN2A突变事件。这些摘要的子集用作最大熵分类器的训练集,以识别区分\ u22相关\ u22不相关摘要的文本特征。每个文档都表示为一组指示性单词,单词对和实体标签生成者衍生的基因组变异特征。当应用于200个候选摘要的测试集时,分类器预测88篇文章是相关的。在这些手册中,有32篇手稿中有29篇是通过人工策展发现CDKN2A序列变异的,被积极预测。因此,手动策展人必须审查的一组潜在有用的文章减少了56%,保持了91%的回忆率(敏感性),并且精度提高了一倍以上(阳性预测值)。随后将训练集扩展到494篇文章,产生了相似的精确度和召回率,并且对原始试验和扩展试验的比较表明,使用更大的数据集可以提高平均精确度。我们的结果表明,自动化系统可以有效地识别与给定任务相关的文章子集,并且可能被证明是更广泛的研究社区的强大工具。此过程可以很容易地适应任何或所有基因,生物或文件集。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号